Document outline

This document provides the code used during the data processing stages of the Brand, Hay, Clark, Watson and Sóskuthy (2020) manuscript, submitted to the Journal of Phonetics. It contains all the steps the authors carried out in order to produce the final data set used in the analysis (see supplementary materials: Analysis file).

Data

Purpose: Install libraries and load data

  • The raw data file being processed contains data from the ONZE corpus using LaBB-CAT (Fromont & Hay, 2008)

  • The data were processed automatically with Praat and were force-aligned using the HTK toolkit

  • The ONZE corpus comprises 4 sub-corpora: Mobile Unit (MU), Intermediate Archive (IA), Darfield (Darfield), Canterbury Corpus (CC)

  • All available speakers were queried and all available tokens containing the following vowel segments were extracted (FLEECE, KIT, DRESS, TRAP, SCHWA, START, LOT, THOUGHT, NURSE, STRUT, FOOT, GOOSE)

  • The resulting raw unprocessed dataset can be found in ONZE_raw.rds

First, we will load in the relevant R libraries required to run the processing procedures and then load in the data itself.

In order for the code in this document to work, the libraries are required to be installed and loaded into your R session. If you do not have any of the libraries installed, you can run install.packages("LIBRARY NAME") (change “LIBRARY NAME” to the required library name, e.g. install.packages("tidyverse")) which should resolve any warning messages you might get.

#load in the libraries
library(tidyverse) #various functions
library(tidylog) #gives summaries
library(ggforce) #plotting functions
library(cowplot) #plotting functions

#load in the data
vowels_all_raw <- readRDS("Filtering_data/ONZE_raw.rds")

Initial tidying

Purpose: Make the data easier to use

Several initial tidying steps are applied to the raw data for ease of use, see the comments (sentences preceeded by a #) for information about what the code is doing. There is no direct modification of the actual data, but new variables are created and some variables are removed.

vowels_all <- vowels_all_raw %>%
  
  #rename the variables
  rename(TokenNum = Number,
         Gender = participant_gender,
         PrevWord = `Before Match`,
         FollWord = `After Match`,
         Word = `Target orthography`,
         WordSegments = `Match segments`,
         VowelDISC = `Target segments`,
         VowelStart = `Target segments start`,
         VowelEnd = `Target segments end`,
         VowelMid = time_0.5,
         F1_50 = `F1-time_0.5`,
         F2_50 = `F2-time_0.5`,
         F3_50 = `F3-time_0.5`) %>%
  
  #create a Well's lexical set variable
  mutate(Vowel = fct_recode(factor(VowelDISC),
                            FLEECE = "i",
                            KIT = "I",
                            DRESS = "E",
                            TRAP = "{",
                            SCHWA = "@",
                            START = "#",
                            LOT = "Q",
                            THOUGHT = "$",
                            NURSE = "3",
                            STRUT = "V",
                            FOOT = "U",
                            GOOSE = "u"),
         
         #recode any gender values read in as FALSE as female
         Gender = fct_recode(factor(Gender), "F" = "FALSE"),
         
         #create a vowel duration variable
         VowelDur = VowelEnd - VowelStart,
         
         #create an orthographic length variable
         orthographic_length = str_length(Word),
         
         #create a phonological length variable
         phonological_length = str_length(WordSegments)
         
         ) %>%
  
  #change all factors to characters
  mutate_if(sapply(., is.factor), as.character) %>%
  
  #keep only the variables which are informative and drop ones that are not
  select(TokenNum, Speaker, Transcript, Corpus, Gender, participant_year_of_birth, Line, LineEnd, MatchId, TargetId, URL, PrevWord, Text, FollWord, Word, WordSegments, Vowel, VowelDISC, F1_50, F2_50, VowelStart, VowelMid, VowelEnd, VowelDur, Error, orthographic_length, phonological_length)
## mutate: changed 200,000 values (10%) of 'Gender' (0 new NA)
##         new variable 'Vowel' with 49 unique values and 0% NA
##         new variable 'VowelDur' with 101,088 unique values and <1% NA
##         new variable 'orthographic_length' with 29 unique values and <1% NA
##         new variable 'phonological_length' with 37 unique values and 0% NA
## mutate_if: converted 'SearchName' from factor to character (0 new NA)
##            converted 'Transcript' from factor to character (0 new NA)
##            converted 'Speaker' from factor to character (0 new NA)
##            converted 'Corpus' from factor to character (0 new NA)
##            converted 'Gender' from factor to character (0 new NA)
##            converted 'MatchId' from factor to character (0 new NA)
##            converted 'TargetId' from factor to character (0 new NA)
##            converted 'URL' from factor to character (0 new NA)
##            converted 'PrevWord' from factor to character (0 new NA)
##            converted 'Text' from factor to character (0 new NA)
##            converted 'FollWord' from factor to character (0 new NA)
##            converted 'Word' from factor to character (0 new NA)
##            converted 'WordSegments' from factor to character (0 new NA)
##            converted 'VowelDISC' from factor to character (0 new NA)
##            converted 'F1_50' from factor to character (0 new NA)
##            converted 'F2_50' from factor to character (0 new NA)
##            converted 'F3_50' from factor to character (0 new NA)
##            converted 'Error' from factor to character (0 new NA)
##            converted 'Vowel' from factor to character (0 new NA)
## select: dropped 2 variables (SearchName, F3_50)

Data filtering

Purpose: filter out unsuitable data

As the raw data file contains various data points that are not suitable for our final analysis, we will filter them out from the dataset. The following steps are taken:

  • Remove speakers with missing gender or year of birth information (this is required for modelling)

  • Remove transcripts that are word lists (the data should be interview speech)

  • Remove tokens where Praat was not able to extract F1/F2 (these are non-numeric values)

  • Remove tokens where F1 is > 1000 (these are likely errors)

  • Remove tokens with vowel durations <= 0.01 or >= 3 (these are likely errors)

  • Remove tokens with a phonological length > 25 (these are likely errors)

  • Remove tokens with a hesitation (these will affect the production of the vowel)

  • Remove tokens where the word is not transcribed

  • Remove tokens from a list of stopwords (these are high frequency grammatical words that may not be representative of a speakers vocalic productions)

#create a list of stopwords
stopWords <- c('a', 'ah', 'ahh', 'am', "an'", 'an', 'and', 'are', "aren't", 'as', 'at', 'aw', 'because', 'but', 'could', 'do', "don't", 'eh', 'for', 'from', 'gonna', 'had', 'has', 'have', 'he', "he's", 'her', 'high', 'him', 'huh', 'I', "I'll", "I'm", "I've", "I'd", 'in', 'into', 'is', 'it', "it's", 'its', 'just', 'mean', 'my', 'nah', 'not', 'of', 'oh', 'on', 'or', 'our', 'says', 'she', "she's", 'should', 'so', 'than', 'that', "that's", 'the', 'them', 'there', "there's", 'they', 'this', 'to', 'uh', 'um', 'up', 'was', "wasn't", 'we', 'were', 'what', 'when', 'which', 'who', 'with', 'would', 'yeah', 'you', "you've")

#apply the filtering
vowels_all <- vowels_all %>%
  
  filter(
         #filter missing gender and participant year of birth
         !is.na(Gender),
         !is.na(participant_year_of_birth),
         
         #filter word lists
         !grepl("-WL", Transcript),
         !Transcript %in% c("myp94-21a-10.trs", "mop96-21a-09.trs", "mon96-18-11.trs", "mop95-4a-07.trs", "myn96-15a-10.trs", "fop97-8-12.trs", "fyp96-19-08.trs", "myn94-27b-08.trs", "myn96-8a-06.trs", "myp95-23b-02.trs", "myp96-4-08.trs"),
         
         #filter tokens with an error or missing F1/F2
         is.na(Error) |
         !is.na(F1_50) |
         !is.na(F2_50),
         
         #filter tokens that have 4 or more characters, this indicates that they are i. an error term outputted from Praat or ii. greater than 1000hz for F1
         str_length(F1_50) < 4,
         
         #filter tokens with very short or long vowel durations
         VowelDur >= 0.01,
         VowelDur <= 3,
         
         #filter tokens with a very long phonological length
         phonological_length < 25,
         
         !grepl("~", Word),
         
         #filter tokens which do not have the word transcribed
         !is.na(Word),
         
         #filter stopwords
         !Word %in% stopWords
         
         ) %>%
  
  #ensure the F1/F2 variables are numeric as any error strings are now removed
  mutate(F1_50 = as.numeric(as.character(F1_50)),
         F2_50 = as.numeric(as.character(F2_50)))
## filter: removed 983,209 rows (49%), 1,037,654 rows remaining
## mutate: converted 'F1_50' from character to double (0 new NA)
##         converted 'F2_50' from character to double (0 new NA)

Speech rate and stress

Purpose: add speech rate and stress coding, filtering out unstressed tokens

As the original dataset did not have information for speaker’s speech rate on each transcript or the stress coding of the tokens, we have to add this information to the dataset. Additionally, as unstressed tokens are produced differently to stressed tokens, we chose to remove all SCHWA and unstressed tokens from the dataset. There is additional data loss as some of the tokens do not have stress coded, when this is the case, they are also removed.

Speech rate:

#load in the speech rate data and modify the variable names
vowels_all_time_sr <- readRDS("Filtering_data/ONZE_speech_rates.rds") %>%
  
  #filter tokens without speech rate information or the transcript is not present in the dataset
  filter(!is.na(Speech_rate),
         Transcript %in% vowels_all$Transcript)
## filter: removed 811 rows (20%), 3,156 rows remaining
#add the speech rate variable to the dataset
vowels_all <- vowels_all %>%
  left_join(., vowels_all_time_sr)
## Joining, by = c("Transcript", "Corpus")
## left_join: added one column (Speech_rate)
##            > rows only in x           0
##            > rows only in y  (        0)
##            > matched rows     1,037,654
##            >                 ===========
##            > rows total       1,037,654

Stress:

#load in the stress data
vowels_stress <- readRDS("Filtering_data/ONZE_stress.rds")

#remove SCHWA tokens
vowels_all2 <- vowels_all %>%
  filter(Vowel != "SCHWA") %>%
  left_join(., vowels_stress)
## filter: removed 175,121 rows (17%), 862,533 rows remaining
## Joining, by = c("Speaker", "Transcript", "Corpus", "Line", "LineEnd", "MatchId", "TargetId", "URL", "Text")
## left_join: added 16 columns (SearchName, Number, Before Match, After Match, Target stress, …)
##            > rows only in x      33,228
##            > rows only in y  (1,192,435)
##            > matched rows       829,305
##            >                 ===========
##            > rows total         862,533
#create list of words with stress coding
vowels_all_present_stress <- vowels_all2 %>%
  filter(!is.na(`Target stress`),
         `Target stress` != "0") %>%
  select(WordSegments, `Target segments`, `Target stress`) %>%
  distinct()
## filter: removed 386,312 rows (45%), 476,221 rows remaining
## select: dropped 41 variables (TokenNum, Speaker, Transcript, Corpus, Gender, …)
## distinct: removed 457,385 rows (96%), 18,836 rows remaining
#create list of words missing stress coding
vowels_all_missing_stress <- vowels_all2 %>%
  filter(is.na(`Target stress`),
         `Match segments` %in% vowels_all_present_stress$WordSegments) %>%
  mutate(`Target stress` = "stress")
## filter: removed 708,544 rows (82%), 153,989 rows remaining
## mutate: converted 'Target stress' from factor to character (153989 fewer NA)
#add in tokens with stress
vowels_all <- vowels_all2 %>%
  filter(!is.na(`Target stress`),
         `Target stress` != "0") %>%
  rbind(vowels_all_missing_stress)
## filter: removed 386,312 rows (45%), 476,221 rows remaining
#clean up the data
vowels_all <- vowels_all %>%
  select(TokenNum:Speech_rate, `Target stress`)
## select: dropped 15 variables (SearchName, Number, Before Match, After Match, Target stress start, …)

Token Outlier removal

Purpose: filter out outliers

As the data has been force aligned and not undergone hand correction of the F1/F2 values, we will implement an outlier removal step so that values that are likely to be errors are filtered from the dataset. The way we do this is:

  • Calculate the mean and standard deviation for F1 and F2, per vowel and per speaker

  • Calculate a min and max threshold for F1 and F2 values, based on the mean + or - 2.5 standard deviations

  • Determine if a token is smaller or larger than the threshold, in which case it is classified as an outlier and it is removed from the dataset

#outlier removal
sd_limit = 2.5

#calculate the summary statistics required for the outlier removal
vowels_all_summary <- vowels_all %>%
  group_by(Speaker, Vowel) %>%
  summarise(n = n(),
            mean_F1 = mean(F1_50, na.rm = TRUE),
            mean_F2 = mean(F2_50, na.rm = TRUE),
            sd_F1 = sd(F1_50, na.rm = TRUE),
            sd_F2 = sd(F2_50, na.rm = TRUE),
            max_F1 = mean(F1_50) + sd_limit*(sd(F1_50)),
            min_F1 = mean(F1_50) - sd_limit*(sd(F1_50)),
            max_F2 = mean(F2_50) + sd_limit*(sd(F2_50)),
            min_F2 = mean(F2_50) - sd_limit*(sd(F2_50)))
## group_by: 2 grouping variables (Speaker, Vowel)
## summarise: now 6,511 rows and 11 columns, one group variable remaining (Speaker)
#store the outlier tokens data
outlier_tokens <- vowels_all %>%
  inner_join(., vowels_all_summary) %>%
  mutate(outlier = ifelse(F1_50 > min_F1 &
           F1_50 < max_F1 &
           F2_50 > min_F2 &
           F2_50 < max_F2, FALSE, TRUE)) %>%
  group_by(Speaker, Vowel) %>%
  filter(outlier == TRUE) %>%
  ungroup() %>%
  select(TokenNum:`Target stress`)
## Joining, by = c("Speaker", "Vowel")
## inner_join: added 9 columns (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
##             > rows only in x  (      0)
##             > rows only in y  (      0)
##             > matched rows     630,210
##             >                 =========
##             > rows total       630,210
## mutate: new variable 'outlier' with 3 unique values and <1% NA
## group_by: 2 grouping variables (Speaker, Vowel)
## filter (grouped): removed 603,324 rows (96%), 26,886 rows remaining
## select: dropped 10 variables (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
#add the summary statistics and filter out outliers
vowels_all <- vowels_all %>%
  inner_join(., vowels_all_summary) %>%
  mutate(outlier = ifelse(F1_50 > min_F1 &
           F1_50 < max_F1 &
           F2_50 > min_F2 &
           F2_50 < max_F2, FALSE, TRUE)) %>%
  group_by(Speaker, Vowel) %>%
  filter(outlier == FALSE) %>%
  ungroup() %>%
  select(TokenNum:`Target stress`)
## Joining, by = c("Speaker", "Vowel")
## inner_join: added 9 columns (n, mean_F1, mean_F2, sd_F1, sd_F2, …)
##             > rows only in x  (      0)
##             > rows only in y  (      0)
##             > matched rows     630,210
##             >                 =========
##             > rows total       630,210
## mutate: new variable 'outlier' with 3 unique values and <1% NA
## group_by: 2 grouping variables (Speaker, Vowel)
## filter (grouped): removed 26,891 rows (4%), 603,319 rows remaining
## select: dropped 10 variables (n, mean_F1, mean_F2, sd_F1, sd_F2, …)

Inspecting the token counts per vowel highlights that there is considerably fewer FOOT tokens in the datset compared to the other vowels, as we require speakers to have a large as possible inventory of tokens for all vowels, we decided to removed all FOOT tokens from the dataset in order to minimise the likelihood that some speakers having a small token count for a specific vowel.

#get token counts per vowel
vowels_all %>%
  group_by(Vowel) %>%
  summarise(n = n()) %>%
  arrange(n)
## group_by: one grouping variable (Vowel)
## summarise: now 11 rows and 2 columns, ungrouped
#filter out FOOT tokens
vowels_all <- vowels_all %>%
  filter(Vowel != "FOOT")
## filter: removed 17,241 rows (3%), 586,078 rows remaining

Speaker outlier removal

Purpose: filter out speakers who have low quality data

As there may be some sepakers whose alignment is particularly poor, we want to check the mean distance between each speakers vowels. This is achieved by calculating the mean euclidean distance between all the the mean vowels in F1/F2 space. This is essentially raising the question: are there speakers in the dataset who have vowel spaces where the vowels are unnaturally overlapping, e.g. is it the case for any given speaker that the mean F1/F2 values for all their vowels are within a very small overlapping space. If we do find there are some speakers, this may indicate issues with their automatic allignment and thus, their F1/F2 values are dramatically unreliable.

Note, that these speakers would not have had a reliable outlier removal implemented in the outlier removal step in the last section (i.e. removing tokens that were +/- 2.5 SDs from the mean, calculated per speaker, per vowel), as the mean value in these calculations for speakers with very small euclidean distances between their mean F1/F2 values would not have been reliable in the first instance.

We then remove speakers who are -2 standard deviations from the mean euclidean distance, as these represent relatively overlapped vowel spaces and we define them to be outliers relative to the rest of the speakers. Doing this identifies 12 speakers, who are subsequently removed from the dataset.

To calculate the metric, we will do the following:

  1. Calculate the mean F1/F2 values per speaker, per vowel

  2. Create a euclidean distance matrix based on the distances between each of the vowels, i.e. each speaker will have an 11x11 matrix based on the distance in F1/F2 space between each of the vowels, we would expect FLEECE and START to have a high distance, but STRUT and START to have a small distance

  3. For each speaker, take the mean of the distances per vowel, i.e. the column mean for each vowel in the distance matrix, this will give the average distance of a vowel in comparison to the others

  4. For each speaker, take the mean of all of the vowel distances, this will give the average distance between all vowels for a given speaker, if this value is small it means all vowels are very close to one another, if it is large it means that they are sparsely distributed

  5. We can now use these values to inspect the distribution of speakers distances, the resulting plot will have the mean euclidean distance value on the x axis and the kernel density estimate on the y axis (essentially a smoothed histogram), the solid horizontal line represents the mean, the dashed horizontal lines represent +/- 2 standard deviations from the mean

  6. Remove the speakers who are -2 standard deviations from the mean euclidean distance, this will filter out speakers with particularly overlapping vowel productions

#calculate speaker means and sd
speaker_means <- vowels_all %>% #use the vowels_all data
  group_by(Speaker, Vowel) %>% #group based on per speaker, per vowel
  summarise(n = n(),
            F1_mean = mean(F1_50),
            F2_mean = mean(F2_50),
            F1_sd = sd(F1_50),
            F2_sd = sd(F2_50)) #get the mean F1/F2
## group_by: 2 grouping variables (Speaker, Vowel)
## summarise: now 5,920 rows and 7 columns, one group variable remaining (Speaker)
#caluculate euclidean distances between vowel means
speaker_distances <- speaker_means %>%
  mutate(Dist = #calculate the euclidean distance matrix between the vowel means for each speaker
           colMeans(as.matrix(dist(cbind(F1_mean, F2_mean))))) %>%
  ungroup() %>% #ungroup the speakers and vowels
  group_by(Speaker) %>% #group by just speaker
  summarise(mean_dist = mean(Dist), #calculate the mean distance across all of a speaker's vowels
            sd_dist = sd(Dist)) %>% #calculate the sd too
  mutate(Speaker_dist = paste(round(mean_dist, 2), Speaker)) #create a new variable with the mean euclidean distance and the speaker name for plotting
## mutate (grouped): new variable 'Dist' with 5,920 unique values and 0% NA
## group_by: one grouping variable (Speaker)
## summarise: now 592 rows and 3 columns, ungrouped
## mutate: new variable 'Speaker_dist' with 592 unique values and 0% NA
#plot the distibution
ggplot(speaker_distances, aes(x = mean_dist)) + #plot the distribution
  geom_density() +
  geom_vline(xintercept = mean(speaker_distances$mean_dist), linetype = 1) +
  geom_vline(xintercept = mean(speaker_distances$mean_dist) + 2*sd(speaker_distances$mean_dist), linetype = 2) +
  geom_vline(xintercept = mean(speaker_distances$mean_dist) - 2*sd(speaker_distances$mean_dist), linetype = 2) +
  theme_bw()

#filter speakers who are -2 SDs from the mean euclidean distance
outlier_speakers <- speaker_means %>%
  inner_join(., speaker_distances) %>%
  filter(mean_dist < mean(speaker_distances$mean_dist) - 2*sd(speaker_distances$mean_dist))
## Joining, by = "Speaker"
## inner_join: added 3 columns (mean_dist, sd_dist, Speaker_dist)
##             > rows only in x  (    0)
##             > rows only in y  (    0)
##             > matched rows     5,920
##             >                 =======
##             > rows total       5,920
## filter (grouped): removed 5,780 rows (98%), 140 rows remaining
#plot the outlier speakers vowel spaces
outlier_speakers_plot <- vowels_all %>%
  filter(Speaker %in% outlier_speakers$Speaker) %>%
  inner_join(., outlier_speakers) %>%
  arrange(mean_dist) %>%
  ggplot(aes(x = F2_50, y = F1_50, colour = Vowel)) +
  geom_point(size = 0.05, alpha = 0.5) +
  stat_ellipse(level = 0.67) +
  geom_text(data = outlier_speakers, aes(x = F2_mean, y =  F1_mean, label = Vowel)) +
  scale_x_reverse() +
  scale_y_reverse() +
  theme_bw() +
  theme(legend.position = "none")
## filter: removed 574,234 rows (98%), 11,844 rows remaining
## Joining, by = c("Speaker", "Vowel")
## inner_join: added 8 columns (n, F1_mean, F2_mean, F1_sd, F2_sd, …)
##             > rows only in x  (     0)
##             > rows only in y  (     0)
##             > matched rows     11,844
##             >                 ========
##             > rows total       11,844
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

 #store the outlier speaker data
 outlier_speakers_data <- vowels_all %>%
  filter(Speaker %in% outlier_speakers$Speaker)
## filter: removed 574,234 rows (98%), 11,844 rows remaining
 #filter out the speakers from the dataset
 vowels_all <- vowels_all %>%
  filter(!Speaker %in% outlier_speakers$Speaker)
## filter: removed 11,844 rows (2%), 574,234 rows remaining

Further inspection of the data reveals that there are considerably more speakers with vowel spaces that look like there are issues with the raw data. Thus, individual vowel plots and density distibutions for F1/F2 were generated for each of the speakers in the dataset. These can be found in the Speaker_vowel_plots folder of the repository for all speakers in the data (referring to the data at this stage of the filtering process).

Based on the inspection, we identified 76 further speakers with vowel spaces that appeared to have substantially low quality data for F1/F2 values. These are plotted below for inspection.

Additional exploration of these speakers suggests that there appears to be large variation in the F2 measurements, with values having substanitally more spread than those speakers not identified as being low quality. This indicates that the outlier filtering of +/- 2 standard deviations to the raw data may not be sufficient in removing unreliable measurements. Alternatively, it could explained by the visual nature of the inspection process, with the inspection being biased towards identifying speakers with more variable F2 across certain vowels.

Whilst manually inspecting the vowel spaces and distribution may be not be an optimal solution to identifying speakers with low quality data, it does highlight underlying issues with automatic extraction of formant values. Furthermore, upon listening to the recordings of some of the low quality speakers, some of these issues can be explained simply by recording quality. Given that our dataset is substanitally large (for sociophonetic research), we found that a compromise between automatic extraction and manual inspection resulted in a large enough dataset that was of sufficiently high quality for our analyses.

Note, if you are running this document within R, the time taken to produce and save the plots can be over 10 minutes.

#generate individual vowel plots for all speakers and density plots of the F1/F2 distubtions
cat("Processing:\n")

for (i in levels(factor(vowels_all$Speaker))) {

cat(paste(i))
  
#density plot for F1
density_F1 <- vowels_all %>%
  dplyr::filter(Speaker == i) %>%
  pivot_longer(cols = F1_50:F2_50, names_to = "F1", values_to = "F_value") %>%
  dplyr::filter(F1 == "F1_50") %>%
  ggplot(aes(x = F_value, colour = factor(Vowel))) +
  geom_density() +
  facet_grid(F1~Vowel) +
  theme_bw() +
  theme(legend.position = "none", axis.text.x = element_text(size = 5))
  
#density plot for F2
density_F2 <- vowels_all %>%
  dplyr::filter(Speaker == i) %>%
  pivot_longer(cols = F1_50:F2_50, names_to = "F2", values_to = "F_value") %>%
  dplyr::filter(F2 == "F2_50") %>%
  ggplot(aes(x = F_value, colour = factor(Vowel))) +
  geom_density() +
  scale_x_reverse() +
  facet_grid(F2~Vowel) +
  theme_bw() +
  theme(legend.position = "none", axis.text.x = element_text(size = 5))

#vowel space plot  
vowel_space <- vowels_all %>%
    dplyr::filter(Speaker == i) %>%
    ggplot(aes(x = F2_50, y = F1_50, colour = Vowel)) +
    geom_point(size = 0.05, alpha = 0.4) +
    stat_ellipse(level = 0.67) +
    geom_text(data = speaker_means %>% dplyr::filter(Speaker == i), aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
    scale_x_reverse() +
    scale_y_reverse() +
    facet_wrap(~Speaker, scales = "free") +
    theme_bw() +
    theme(legend.position = "none")
  
  density_plots <- plot_grid(density_F1, NULL, density_F2, NULL, nrow = 2, ncol = 2, rel_widths = c(5, 3))
  combined_plot <- ggdraw(density_plots + draw_plot(vowel_space, x = 0.6, y = -0.1, scale = 0.8, width = 0.4, height = 1.2))
  
  #save the file
  ggsave(filename = paste0("Speaker_vowel_plots/", i, "_plot.png"), plot = combined_plot, width = 12, height = 4, dpi = 300)
cat(paste("✅"))
}
#list of speakers with low quality vowel plots
speakers_remove <- c(
"fop96-12a", "Ada Aitcheson", "Anna Hayes", "Anne Reed",
"Annette Golding", "Annie Hamilton", "Cap Jardine", "Catherine King",
"chchFY2", "chchMO1", "Christina Bisset", "darfe1",
"darFE2", "darFO1 (duplicate)", "darFO4", "darFY2",
"darFY3", "darFY5", "darME2", "darMY5",
"David Algie", "David Moore", "Dorothy Hagitt", "Edith German",
"Elizabeth Arnott", "Ella Williams", "fon94-3", "fon94-25c",
"fon96-27a", "fon98-7b", "fon98-17b", "fop04-6",
"fop06-8", "fop95-27", "fyn00-4b", "fyn01-18",
"fyn06-7", "fyn07-5a", "fyn94-9b", "fyn94-12a",
"fyn94-22b", "fyn96-6b", "fyn99-22a", "fyp94-26a",
"fyp96-3b", "fyp96-24", "fyp99-13a", "George Goodyear",
"Hannah Cross", "Jane Reid", "Jean Atkinson", "Jessie Drinnan",
"Jim Wilcox", "Joan Wicks", "Joyce Creighton", "Lillian Aitken",
"Lloyd Algie", "Maureen Weir", "Mavis Jackson", "Millie Harris",
"mon06-9", "mon99-1b", "mop06-6", "mop99-9",
"mop99-17a", "myn00-11c", "myn02-16b", "myn05-5a",
"myp04-7", "myp05-4", "myp07-6", "Myra Ralston",
"Nan Hay", "Pauline Grither", "Rupert Pyle", "William Warren"
)

#plot the outlier speakers vowel spaces
low_quality_speakers_plot <- vowels_all %>%
  filter(Speaker %in% speakers_remove) %>%
  inner_join(., speaker_means) %>%
  ggplot(aes(x = F2_50, y = F1_50, colour = Vowel)) +
  geom_point(size = 0.05, alpha = 0.5) +
  stat_ellipse(level = 0.67) +
  geom_text(aes(x = F2_mean, y =  F1_mean, label = Vowel)) +
  scale_x_reverse() +
  scale_y_reverse() +
  theme_bw() +
  theme(legend.position = "none")
## filter: removed 489,344 rows (85%), 84,890 rows remaining
## Joining, by = c("Speaker", "Vowel")
## inner_join: added 5 columns (n, F1_mean, F2_mean, F1_sd, F2_sd)
##             > rows only in x  (     0)
##             > rows only in y  ( 5,160)
##             > matched rows     84,890
##             >                 ========
##             > rows total       84,890

#plot distibutions of F1 and F2 comparing low quality speakers to those not identified as being low quality

#F1
speaker_means %>%
  mutate(removed = Speaker %in% speakers_remove) %>%
  ggplot(aes(x = F1_sd, colour = removed, fill = removed)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~Vowel) +
  ggtitle("F1 sd distirbutions") +
  theme_bw()
## mutate (grouped): new variable 'removed' with 2 unique values and 0% NA

#F2
speaker_means %>%
  mutate(removed = Speaker %in% speakers_remove) %>%
  ggplot(aes(x = F2_sd, colour = removed, fill = removed)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~Vowel) +
  ggtitle("F2 sd distributions") +
  theme_bw()
## mutate (grouped): new variable 'removed' with 2 unique values and 0% NA

#store the low quality speakers data
low_quality_speakers_data <- vowels_all %>%
  filter(Speaker %in% speakers_remove)
## filter: removed 489,344 rows (85%), 84,890 rows remaining
#filter out the low quality speakers from the dataset
vowels_all <- vowels_all %>%
  filter(!Speaker %in% speakers_remove)
## filter: removed 84,890 rows (15%), 489,344 rows remaining

Final filtering

Purpose: apply additional filtering to the data

In order to ensure that speakers in the dataset have sufficient numbers of tokens for each of the vowels, we will next filter out any speaker with < 5 tokens for any of the vowels. The code below will calculate the number of tokens per speaker and per vowel, then remove that speaker from the dataset (including all of their other tokens).

#count number of vowels per speaker and store list of speakers with < 5 tokens in any vowel
low_n_speakers <- vowels_all %>%
  group_by(Speaker, Vowel) %>%
  count() %>%
  ungroup() %>%
  filter(n < 5) %>%
  select(Speaker)
## group_by: 2 grouping variables (Speaker, Vowel)
## count: now 5,020 rows and 3 columns, 2 group variables remaining (Speaker, Vowel)
## filter: removed 5,011 rows (>99%), 9 rows remaining
## select: dropped 2 variables (Vowel, n)
#filter out the speakers with < 5 tokens in any vowel
vowels_all <- vowels_all %>%
  ungroup() %>%
  filter(!Speaker %in% low_n_speakers$Speaker)
## filter: removed 1,997 rows (<1%), 487,347 rows remaining

We also wanted to filter out any tokens that have a l or r as the segment following the vowel being analysed. Therefore, we queried LaBB-CAT for this information which is stored in the ONZE_following_segments.rds file. We will load in this data and add the relevant information to our current dataset, then remove any tokens that have a following l or r.

#load in the following segment data
following_segment_data <- readRDS("Filtering_data/ONZE_following_segments.rds") %>%
  rename(following_segment = Token.1..segments) %>%
  mutate(following_segment = as.character(following_segment)) %>%
  select(Speaker:following_segment, -Error, -Target.stress)
## mutate: converted 'following_segment' from factor to character (0 new NA)
## select: dropped 5 variables (TokenNum, Error, Target.stress, Token.1..segments.start, Token.1..segments.end)
#add the following segement information to the dataset then filter out tokens with a `l` or `r` following the vowel
vowels_all <- vowels_all %>%
  cbind(., following_segment_data[,c("following_segment")]) %>%
  rename(following_segment = `following_segment_data[, c(\"following_segment\")]`) %>%
  filter(!following_segment %in% c("l", "r"))
## filter: removed 58,296 rows (12%), 429,051 rows remaining

Save the dataset

Purpose: export the filtered data for the analysis script

Now that the data filtering steps have been carried out, the final dataset is ready to be saved and used in the analysis file. We will store this file in the data folder of the repository.

vowels_all <- vowels_all %>%
  arrange(Speaker) %>%
  mutate_if(is.character, as.factor)
## mutate_if: converted 'Speaker' from character to factor (0 new NA)
##            converted 'Transcript' from character to factor (0 new NA)
##            converted 'Corpus' from character to factor (0 new NA)
##            converted 'Gender' from character to factor (0 new NA)
##            converted 'MatchId' from character to factor (0 new NA)
##            converted 'TargetId' from character to factor (0 new NA)
##            converted 'URL' from character to factor (0 new NA)
##            converted 'PrevWord' from character to factor (0 new NA)
##            converted 'Text' from character to factor (0 new NA)
##            converted 'FollWord' from character to factor (0 new NA)
##            converted 'Word' from character to factor (0 new NA)
##            converted 'WordSegments' from character to factor (0 new NA)
##            converted 'Vowel' from character to factor (0 new NA)
##            converted 'VowelDISC' from character to factor (0 new NA)
##            converted 'Error' from character to factor (0 new NA)
# write.csv(vowels_all, "Data/ONZE_vowels_filtered.csv", row.names = FALSE)
saveRDS(vowels_all, "Data/ONZE_vowels_filtered.rds")